In this R markdown we analyze the Traumabase Registry to estimate the treatment effect of the drug Tranexamic Acid (TXA) on mortality among patients with traumatic brain injury (TBI).

For this analysis, an expert committee has identified 17 confounders available in the Traumabase: Trauma.center, Cardiac.arrest.ph, SBP.ph.min, DBP.ph.min, HR.ph.max, SBP.ph, DBP.ph, HR.ph, Shock.index.ph (HR.ph/SBP.ph), Cristalloid.volume, Colloid.volume, SpO2.ph.min, Vasopressor.therapy, HemoCue.init, Delta.hemoCue, AIS.external, Activation.HS.procedure.

1 Preliminaries

1.1 Load libraries

library(cobalt)
library(ggplot2)
library(dplyr)
library(reshape2)
library(tableone)
library(MASS)
library(pracma)
library(assertthat)

library(caret) # 
library(ranger)# Random Forests prediction
library(FactoMineR) # Factorial data analysis
library(grf) # Doubly robust treatment effect estimation

library(norm)
library(missMDA) # PCA/MCA with missing values + iterative PC imputation
library(mice) # Multiple imputation
library(VIM) # Missing values exploration and visualization

# Define data directory
data.dir <- "~/Documents/TraumaMatrix/CausalInference/AcideTranexamique/Data/"
rdata.dir <- "~/Documents/TraumaMatrix/CausalInference/Simulations/causal-inference-missing/TranexamicAcid/RData/"
fig.dir <- "~/Documents/TraumaMatrix/CausalInference/Simulations/causal-inference-missing/TranexamicAcid/Figures/"

library(devtools)
# Load the ATE estimation functions `ipw` and `dr` from the GitHub repository
source_url("https://raw.githubusercontent.com/imkemayer/causal-inference-missing/master/Helper/helper_causalInference.R")
## SHA-1 hash of file is 72aa0bae8732bb83bbb18ae5189baae2b7171c71
source_url("https://raw.githubusercontent.com/imkemayer/causal-inference-missing/master/Helper/helper_imputation.R")
## SHA-1 hash of file is 66fca1a6480553b69007d1f9896cd3097606fb98
save_results <- FALSE
tau <- NULL
# Set random generator seed for reproducible results
set.seed(0)

1.2 Load data

Run R markdown preprocess_ate_analysis_traumabase_example.Rmd prior to this to pre-process and save the data for the following analyses.

data_indTBI <- read.csv(paste0(data.dir, 
                               "ate_analysis_traumabase_large_data_preprocessed_tbi_individuals.csv"), 
                        row.names = 1)
seed <- 1234
synthetic <- FALSE

X.na <- data_indTBI[,setdiff(colnames(data_indTBI), 
                          c("TBI", "Tranexamic.acid", "Death"))]

W <- data_indTBI$Tranexamic.acid
Y <- data_indTBI$Death

covariate_names <- colnames(X.na)
n <- dim(X.na)[1]
p <- dim(X.na)[2]

df.na <- data.frame(cbind(X.na, W=W, Y=Y))
colnames(df.na) <- c(covariate_names, "W", "Y")

This pipeline requires W to be binary and coded either with {0,1} or with {FALSE,TRUE}, representing {control,treatment}.

assert_that(length(unique(W))==2)
## [1] TRUE
assert_that(unique(W) %in% c(0,1) || unique(W) %in% c(FALSE,TRUE))
## [1] TRUE

In certain cases, not all covariates are necessarily confounders. Therefore we specify the set of confounders in confounder_names. Additionally, in some cases there are also variables that are only predictive of the treatment assignment. Their names can be specified in only_treatment_pred_names.

confounder_names <- c("Trauma.center", 
                      "SBP.ph", "DBP.ph", "HR.ph", 
                      "SBP.ph.min", "DBP.ph.min", "HR.ph.max",  
                      "Cardiac.arrest.ph", "Vasopressor.therapy", "SpO2.ph.min",
                      "AIS.external", "HemoCue.init", "Delta.hemoCue", 
                      "Shock.index.ph", "Cristalloid.volume", "Colloid.volume", 
                      "Activation.HS.procedure")
only_treatment_pred_names <- c()
only_outcome_pred_names <- setdiff(covariate_names, 
                                   c(confounder_names, only_treatment_pred_names))

For the DR estimation, we can specify clusters, here we will use the Trauma centers as clusters.

cluster_names <- c("Trauma.center")

We list binary and other categorical variables to ensure correct type casts.

binary_names <- intersect(covariate_names,
                          c("Cardiac.arrest.ph", "Anticoagulant.therapy", 
                            "Antiplatelet.therapy", "EVD", "Activation.HS.procedure",
                            "Neurosurgery.day0", "Fi02", "Vasopressor.therapy"))

categorical_names <- intersect(covariate_names,
                               c("Trauma.center", "Pupil.anomaly", 
                                 "Osmotherapy", "Osmotherapy.ph",
                                 "Pupil.anomaly.ph",
                                 "Improv.anomaly.osmo"))

It is possible that during the loading of the data set, certain variables are not cast correctly. We check that the variable types are all correct according to the specified types in binary_names and categorical_names.

Look at the Activation.HS.procedure variable. The treatment TXA is generally given to prevent the occurrence of hemorrhagic shock, a condition which is often fatal. Let us see how is the treatment distributed w.r.t. this variable.

res.table <- table(df.na$W==1, df.na$Activation.HS.procedure, 
                   dnn = c("Treated","Activation.HS.procedure"),
                   useNA = "ifany")
print(res.table)
##        Activation.HS.procedure
## Treated FALSE TRUE <NA>
##   FALSE  6939  345  281
##   TRUE    263  392   28

First we compute the unadjusted ATE, ignoring the confounders.

ate_raw <- mean(Y[which(as.logical(W))]) - mean(Y[which(!as.logical(W))])

Let us also assess how the outcome is distributed with respect to the treatment group.

if (length(unique(Y))==2){
  res.table <- table(as.factor(W), as.factor(Y), 
                     dnn = c("W","Y"))
  prop.table(res.table)
} else {
  ggplot(data.frame(W=as.factor(W), Y=Y)) + 
    geom_histogram(aes(Y, color=W), fill = "white")
}
##    Y
## W            0          1
##   0 0.75630456 0.16088749
##   1 0.04449564 0.03831232

Among the treated and the control most patients survived. But proportionally, more patients died in the treated group than in the control group. Strong indication for confounding factors that need to be controlled for.

From these two variables we can compute the following observed probabilities:

sprintf("Average outcome: %2.1f", mean(Y))
## [1] "Average outcome: 0.2"
sprintf("P(Treatment=1) : %.2f", mean(W==1))
## [1] "P(Treatment=1) : 0.08"
sprintf("E(Y | Treatment=1) : %.2f", mean(Y*(W==1))/mean(W==1))
## [1] "E(Y | Treatment=1) : 0.46"
sprintf("E(Y | Treatment=0) : %.2f", mean(Y*(W==0))/mean(W==0))
## [1] "E(Y | Treatment=0) : 0.18"
sprintf("ATE without adjustment : %.4f", ate_raw)
## [1] "ATE without adjustment : 0.2873"

These uncorrected empirical quantities suggest that the treatment has a negative impact on the chances of survival.

However, treated patients tend to have lower Glasgow.initial - and are therefore potentially more severe since according to the practitioners this covariate is a predictor for the patient’s severity - which highly suggests that the treatment is not given uniformely at random.

ggplot(data = data.frame(X.na, W=as.factor(W))) +
  geom_density(aes(x=GCS.init, color = W, fill = W), alpha = .2) 

2 Imputation

We impute the data with two different methods (iterative FAMD, mice).

2.1 FAMD

Note that for this approach, an alternative would be to go back to the original dataset, instead of focusing just on the patients with TBI and/or AIS.head >= 2 to impute the missing data. Even though we will only keep the patients with TBI and/or AIS.head>=2 while running the causal analysis, including the other patients at this stage can allow the imputation of missing data to be more informed by adding additional observations for comparison purposes.

Let’s look at the matrix plot to identify if there are any variables that tend to be missing together or if we can detect any clear violation of the missing completely at random hypothesis:

matrixplot(df.na, sortby = 1, las=2, cex.axis = 0.3)

## 
## Click in a column to sort by the corresponding variable.
## To regain use of the VIM GUI and the R console, click outside the plot region.

We can also run MCA on the missing and non missing entries to double check our conclusion and to check if other groups of variables tend to be missing together:

data_miss <- data.frame(is.na(df.na))
data_miss <- apply(X=data_miss, FUN=function(x) if(x) "m" else "o", MARGIN=c(1,2))
res.mca <- MCA(data_miss, graph = FALSE)
plot(res.mca, invis = "ind", title = "MCA graph of the categories", cex =0.5)

We now perform a single imputation with a (regularized) iterative Factorial Analysis for Mixed Data model, which will allow imputation to take into account similarities between both individuals and relationships between variables. See: https://arxiv.org/pdf/1301.4797.pdf

First we find the optimal number of dimensions for the FAMD by cross-validation

Attention: this computation might take a long time. Set eval=F in the chunk header if you want to skip this part.

if (file.exists(paste0(rdata.dir, "traumabase_tbi_ncomp_famd.RData"))) {
  load(paste0(rdata.dir, "traumabase_tbi_ncomp_famd.RData"))
  plot(1:length(ncomp_famd$criterion), ncomp_famd$criterion, xlab = "nb dim", ylab = "MSEP")
} else {
  ncomp_famd <- estim_ncpFAMD(df.na,
  ncp.min = 1,
  ncp.max=15,
  method= "Regularized",
  method.cv = "Kfold",
  nbsim=10,
  verbose = TRUE)

  save(ncomp_famd,
       file = paste0(rdata.dir, "traumabase_tbi_ncomp_famd.RData"))
}

By cross-validation we find ncp=6 to be the optimal number of components.

if (file.exists(paste0(rdata.dir, "traumabase_tbi_", length(confounder_names), 
                       "confounders_", length(covariate_names), 
                       "variables_imputed_famd.RData"))){
  load(file=paste0(rdata.dir, "traumabase_tbi_", length(confounder_names), 
                       "confounders_", length(covariate_names), 
                       "variables_imputed_famd.RData"))
} else {
  ncp <- 6
  df.imp.pc <- get_FAMD(df.na, seed = 0, ncp=ncp, Method="Regularized", 
                        scale=T, threshold = 1e-06)
  
  save(df.imp.pc, file=paste0(rdata.dir, "traumabase_tbi_", 
                              length(confounder_names), 
                              "confounders_", length(covariate_names), 
                              "variables_imputed_famd.RData"))

}

It is possible that during the loading of the imputed data set, certain variables are not cast correctly. We check that the variable types are all correct according to the specified types in binary_names and categorical_names.

2.2 MICE

We also prepare a multiple imputation analysis, using the mice package. We impute the data m=5 times.

if (file.exists(paste0(rdata.dir, "traumabase_tbi_", length(confounder_names), 
                       "confounders_", length(covariate_names), 
                       "variables_imputed_mice.RData"))) {
  load(file=paste0(rdata.dir, "traumabase_tbi_", length(confounder_names), 
                       "confounders_", length(covariate_names), 
                       "variables_imputed_mice.RData"))
} else {
  df.imp.mice <- get_MICE(df.na, seed=0, m=5)
  save(df.imp.mice, file=paste0(rdata.dir, "traumabase_tbi_", 
                                length(confounder_names), "confounders_", 
                                length(covariate_names), 
                                "variables_imputed_mice.RData"))
}

Again we check and correct the variable types.

3 ATE estimation

We are interested in estimating the causal effect of TXA on the mortality of TBI patients.

Although most of the patients receive multiple treatments (in pre-hospital and hospital phase) and we have records of these other treatments (for instance osmotherapy, fibrinogene, etc.) we are only considering the effect of TXA on in-hospital mortality.

3.1 IPW

3.1.1 On imputed data

## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(only_treatment_pred_names)` instead of `only_treatment_pred_names` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(confounder_names)` instead of `confounder_names` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
IPW estimations
Estimand Imputation.method PS.estimation Mask ATE.normalized STD.ATE.normalized ATE.unnormalized
ATE pc.imp glm FALSE 0.29 0.10 0.25
ATE pc.imp grf.ate FALSE 0.17 0.04 0.09
ATE mice glm FALSE 0.27 0.11 0.23
ATE mice grf.ate FALSE 0.17 0.04 0.07
ATT pc.imp glm FALSE 0.30 0.03 0.00
ATT pc.imp grf.ate FALSE 0.10 0.03 0.01
ATT mice glm FALSE 0.12 0.05 0.00
ATT mice grf.ate FALSE 0.11 0.03 0.01
ATC pc.imp glm FALSE 0.29 0.11 0.25
ATC pc.imp grf.ate FALSE 0.18 0.05 0.08
ATC mice glm FALSE 0.29 0.12 0.23
ATC mice grf.ate FALSE 0.17 0.05 0.06
ATE_overlap pc.imp glm FALSE 0.11 0.03 0.01
ATE_overlap pc.imp grf.ate FALSE 0.11 0.03 0.01
ATE_overlap mice glm FALSE 0.12 0.03 0.01
ATE_overlap mice grf.ate FALSE 0.12 0.03 0.01
ATE pc.imp glm TRUE 0.31 0.13 0.27
ATE pc.imp grf.ate TRUE 0.17 0.04 0.08
ATE mice glm TRUE 0.28 0.13 0.24
ATE mice grf.ate TRUE 0.17 0.04 0.06
ATT pc.imp glm TRUE 0.28 0.04 0.00
ATT pc.imp grf.ate TRUE 0.10 0.03 0.01
ATT mice glm TRUE 0.12 0.06 0.00
ATT mice grf.ate TRUE 0.11 0.03 0.01
ATC pc.imp glm TRUE 0.31 0.14 0.27
ATC pc.imp grf.ate TRUE 0.18 0.05 0.07
ATC mice glm TRUE 0.30 0.14 0.25
ATC mice grf.ate TRUE 0.17 0.05 0.05
ATE_overlap pc.imp glm TRUE 0.11 0.03 0.01
ATE_overlap pc.imp grf.ate TRUE 0.11 0.03 0.01
ATE_overlap mice glm TRUE 0.11 0.03 0.00
ATE_overlap mice grf.ate TRUE 0.12 0.03 0.01

3.1.2 MIA+grf

ipw ATE estimations on mia covariates
Estimand Imputation.method PS.estimation Mask ATE.normalized STD.ATE.normalized ATE.unnormalized
ATE mia grf.ate TRUE 0.17 0.05 0.08
ATT mia grf.ate TRUE 0.10 0.03 0.01
ATC mia grf.ate TRUE 0.17 0.05 0.07
ATE_overlap mia grf.ate TRUE 0.11 0.03 0.01

3.2 DR

3.2.1 On imputed data

## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(confounder_names)` instead of `confounder_names` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(only_treatment_pred_names)` instead of `only_treatment_pred_names` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(only_outcome_pred_names)` instead of `only_outcome_pred_names` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(cluster_names)` instead of `cluster_names` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
DR estimations
Estimand Imputation.method PS.estimation Mask ATE STD
ATE pc.imp glm.grf FALSE 0.15 0.07
ATE pc.imp grf.ate FALSE 0.05 0.04
ATE mice glm.grf FALSE 0.10 0.05
ATE mice grf.ate FALSE 0.03 0.04
ATT pc.imp glm.grf FALSE 0.14 0.11
ATT pc.imp grf.ate FALSE 0.04 0.03
ATT mice glm.grf FALSE 0.03 0.03
ATT mice grf.ate FALSE 0.05 0.03
ATC pc.imp glm.grf FALSE 0.13 0.07
ATC pc.imp grf.ate FALSE 0.05 0.04
ATC mice glm.grf FALSE 0.10 0.06
ATC mice grf.ate FALSE 0.02 0.04
ATE_overlap pc.imp glm.grf FALSE 0.04 0.02
ATE_overlap pc.imp grf.ate FALSE 0.04 0.03
ATE_overlap mice glm.grf FALSE 0.04 0.02
ATE_overlap mice grf.ate FALSE 0.04 0.03
ATE pc.imp glm.grf TRUE 0.16 0.08
ATE pc.imp grf.ate TRUE 0.04 0.04
ATE mice glm.grf TRUE 0.11 0.06
ATE mice grf.ate TRUE 0.03 0.04
ATT pc.imp glm.grf TRUE 0.12 0.10
ATT pc.imp grf.ate TRUE 0.04 0.03
ATT mice glm.grf TRUE 0.02 0.03
ATT mice grf.ate TRUE 0.05 0.03
ATC pc.imp glm.grf TRUE 0.15 0.08
ATC pc.imp grf.ate TRUE 0.05 0.04
ATC mice glm.grf TRUE 0.12 0.07
ATC mice grf.ate TRUE 0.03 0.04
ATE_overlap pc.imp glm.grf TRUE 0.03 0.02
ATE_overlap pc.imp grf.ate TRUE 0.04 0.03
ATE_overlap mice glm.grf TRUE 0.03 0.02
ATE_overlap mice grf.ate TRUE 0.05 0.03

3.2.2 MIA+grf

DR ATE estimations on mia covariates
Estimand Imputation.method PS.estimation Mask ATE STD
ATE mia grf.ate FALSE 0.06 0.04
ATT mia grf.ate FALSE 0.04 0.03
ATC mia grf.ate FALSE 0.06 0.05
ATE_overlap mia grf.ate FALSE 0.04 0.03
ATE mia grf.ate TRUE 0.06 0.04
ATT mia grf.ate TRUE 0.04 0.03
ATC mia grf.ate TRUE 0.07 0.05
ATE_overlap mia grf.ate TRUE 0.04 0.03

4 Results

4.1 Plots

We will plot the results for the ATE estimation. Alternatively, you can also plot the results for the other estimands (in the call of the function plot_treatment_effect, set estimand to ATT, ATC or ATE_overlap).

4.1.1 ATE

## [1] "GLM"

## [1] "GRF"

4.1.2 ATT

4.1.3 ATC

4.1.4 ATE_overlap

5 Appendix

5.1 Balance plots

The standardized mean differences (SMD) higher than some threshold th, for instance th=0.1, indicate that the covariate distributions in the two groups differ: the treatment is not given uniformly at random. This explains the need for some adjustment or balancing in order to perform a causal analysis of the treatment on a certain outcome.

Note: small SMDs do not necessarily indicate balanced treatment groups, indeed the distributions can differ on other quantities than the first order.

We use propensity scores estimated via the function regression_forest of the grf package.

5.1.1 On X.imp.pc

We can also use the previously computed weights to look at the balance of the response pattern.

incomplete_confounders <- confounder_names[which(sapply(X.na[,confounder_names], 
                                                        function(x) sum(is.na(x))>0))]
R <- data.frame(is.na(X.na[,incomplete_confounders]))
colnames(R) <- paste(incomplete_confounders, "_NA", sep="")
balance <- bal.tab(R, treat=W, estimand="ATE", weights=weights, method = "weighting", un=TRUE)

love.plot(x=balance, stat="mean.diffs", abs=TRUE, var.order="unadjusted", 
          threshold=0.1, cex=0.8, shapes=c("circle", "triangle"), colors=c(viridis::viridis(10)[3], viridis::viridis(10)[9]),)

bal.plot(x=data.frame(treat=as.logical(W), ps=w.hat, weights=weights), var.name="ps", which="both",
         treat=as.logical(W),
         weights=weights,
         type="histogram", mirror=TRUE)

5.1.2 On X.imp.mice

5.1.3 On X.na using MIA

## Warning: Missing values exist in the covariates. Displayed values omit these
## observations.